Differentiating Unimodal and Multimodal Distributions in Pulsed Dipolar Spectroscopy Using Wavelet Transforms

Site directed spin labeling has enabled protein structure determination using electron spin resonance (ESR) pulsed dipolar spectroscopy (PDS). Small details in a distance distribution can be key to understanding important protein structure-function relationships. A major challenge has been to differentiate unimodal and overlapped multimodal distance distributions. They often yield similar distributions and dipolar signals. Current model-free distance reconstruction techniques such as Srivastava-Freed Singular Value Decomposition (SF-SVD) and Tikhonov regularization can suppress these small features in uncertainty and/or error bounds, despite being present. In this work, we demonstrate that continuous wavelet transform (CWT) can distinguish PDS signals from unimodal and multimodal distance distributions. We show that periodicity in CWT representation reflects unimodal distributions, which is masked for multimodal cases. This work is meant as a precursor to a cross-validation technique, which could indicate the modality of the distance distribution.


Introduction
Protein structure determination remains one of the most challenging and open research subjects. Site-directed spin labeling (SDSL) [1][2][3][4] in combination with pulsed dipolar spectroscopy (PDS) has enabled protein structure determination using electron spin resonance (ESR) spectroscopy [5][6][7][8][9][10]. Typically, a pair of spin probes are attached to the domain of interest in a protein utilizing SDSL. The dipolar coupling between the spin probes at distance r apart is inversely proportional to r 3 . Thus, measuring the dipolar coupling by PDS yields inter-spin distance information. Such distances can resolve aspects of protein structures directly and serve as crucial constraints in structural studies.
Proteins are highly dynamic entities and their conformational ensembles give rise to distance distributions between the spin pairs, P (r), rather than a single value of r, as shown in Figure 1. The process of deriving P (r) from a PDS time domain signal, S(t), is an ill-posed problem [11,12]. In general, a PDS signal can be expressed as S(t) = dr κ(t, r) P (r) (1) where κ(t, r) is the kernel that depends on t and r, averaged over the angle θ between the inter-spin dipolar vector and the direction of the external magnetic field. For all PDS techniques, κ(t, r) is singular and therefore, Equation 1 cannot be solved for P (r) by a simple inversion of κ(t, r). Various techniques have been proposed over the years to derive P (r) from PDS signals, including model-free [13][14][15][16][17][18], model-based [19][20][21] and training-based methods [22,23]. In model-free approaches, such as Tikhonov regularization (TIKR) [13] and SF-SVD [16,17], distance distributions are heavily reliant on the PDS time domain signals, as they operate independently of a priori information. Because of the nature of the problem of determining P (r), the solutions raise uncertainties, especially when the P (r) contains weak and/or shoulder peaks. In such cases, often there is no way to confirm whether the solutions truly represent the P (r) or they are artifact driven. Fig. 1 PDS signals capture the dipolar interaction between a pair of spin labels attached to a protein molecule. Post processing of the signal yields the distance distribution, P (r), between the spin pair.

Major Challenge: Reconstruction of Small Details in P(r)
The kernel for DEER is given by where the dipolar constant, a = µ 0 µ 2 B g 2 e /2 h, µ 0 is the magnetic constant, µ B the Bohr magneton, g e the free-electron g value and h the Planck constant. The DEER signal in its discrete form can be written as where K and P are the kernel matrix and distance distribution vector. Note that the expression given in Equation 3 corresponds to the DEER signal originating from the interaction of an isolated pair of spin-1/2 particles. In a standard DEER experiment, the inter-molecular signal (or background) must be removed first.
The P (r) used in the simulations are shown in Figure 2. All the distance distributions in model-I (top row) were produced by mixing different Gaussian distributions (shaded area). For model-II (bottom row), both Gaussian and Cauchy distributions (defined in Equation 4) were mixed in producing the distance distributions. The P (r) for each model is so designed that they are very similar with minor differences. Such small differences in the P (r) can be key in understanding protein structure-function relationships and structural changes. It is often challenging to reconstruct such small details in the P (r) with great confidence. The DEER signals utilized for model-I and -II distance distributions are shown in Figure 3. Visual inspection of the DEER time traces hardly shows any differences, while the differences in their distance distributions are visible in the overlapped plots of the scaled P (r) (the left panel of Figure 3). Reconstructions by the SF-SVD [16,17] and the DEERLab TIKR method [24] were compared with the model distance distributions in Figures 4 and 5. While both methods captured major parts of the distance distribution patterns, the solutions varied significantly in some cases, e.g., (I.C-D) and (II.G-H) in Figure 4 and (II.B-D) and (II.F-H) in Figure 5. This raises considerable doubt over the true nature of the P (r) in such cases. The shaded (gray) region in those figures represents the uncertainty (SF-SVD), the 50% and 95% confidence intervals (DEERLab TIKR). For the SF-SVD solutions, the uncertainty is much less than those of the TIKR, especially for the multi-modal distributions. It is visible in both cases, but mainly for TIKR, that the uncertainty is greater in regions near the minor peak positions in the model P (r). More importantly, the 95% confidence interval for TIKR solutions, especially for model-II, shows large uncertainty associated with the solutions. At present, no cross validation method exists to confirm the existence of multi-modal distance distributions with one or more minor (or shoulder) peaks, which is necessary to improve the robustness of PDS analysis.

Proposed Method
Time-frequency analysis [25][26][27][28][29][30] is a reliable method to decouple a signal into its distinct constituent components by projecting it on the time-frequency plane. Short Time Fourier Transform (STFT) [25,26,29] is another strategy for such analysis, but a fixed window associated with STFT makes it unsuitable for separation of overlapping signal components. Wavelet transforms is a powerful method with great flexibility in timefrequency analysis and hence, it is extremely useful in extracting localized information from various types of signals [31,32]. We propose the application of continuous wavelet transforms (CWT) in time-frequency analysis to (1) identify differences in P (r) for practically identical PDS signals and (2) confirm the existence of multimodal P (r).

Wavelet Transform
A wavelet transform (WT) can simultaneously represent the time-frequency information for analysis through signal localization and is defined as [33][34][35] where s is the inverse frequency (or frequency range) observing parameter (also called scale parameter), τ is the signal localization parameter (also called translation parameter), t represents the signal location, f (t) is the signal, F (τ, s) is the wavelettransformed signal at a given signal localization and frequency, and ψ * t−τ s is the signal probing function obtained from a function called the "wavelet", ψ(t). The functions ψ(t) and ψ * t−τ s are commonly referred as "mother" and "daughter" wavelet, respectively, because ψ * t−τ s is derived from ψ(t). ψ * (t) is the complex conjugate of ψ(t), which for a real function is the same, (ψ * (t) = ψ(t)).
The Fourier Transform (FT) can be considered as a special limiting case of the WT wherein s → (−iω) −1 , τ → 0, and ψ → e −iωt . Whereas a FT integrates out the time dependence, the WT is a function of both frequency, s −1 and time, τ and thus can display correlations in the signal between them.
Unlike STFT, the WT employs a variable window width and a frequency parameter incorporated in the wavelet, that allows variation in both signal (e.g. time) and frequency. This informs about locations of a particular frequency in the signal domain as well as identifies all frequencies that are present at a particular signal location or interval. It results in analyzing a signal into different frequencies at different resolutions, allowing what is known as "multiresolution analysis".
The wavelet-transformed signal F (τ, s) is represented in the signal domain at a frequency or frequency range, unlike the FT and STFT that represents signal just in the frequency domain. The location of data points in the wavelet domain is spatially correlated with the location of the signal domain. This reveals how a signal looks when observed from a specific frequency or frequency range.
The signal is reconstructed by inverse WT which is given as where C ψ is admissibility constant which is written as where Ψ(ω) is the FT of the wavelet function ψ(t). The constraint in Equation 7 implies that the wavelet function ψ(t) must also be oscillatory with zero mean i.e., +∞ −∞ ψ(t) dt = 0.

Discretized Continuous Wavelet Transform (CWT)
Similar to the Fourier Transform, the WT in Equation 5 is impractical for discrete data and a discretized version of CWT is used. For practical purposes, the translation parameter and the scale parameter are discretized as τ = a and s = b, a and b both being integers. The CWT of a discrete input signal is defined as where, C[a, b] is the wavelet-transformed signal and f [t m ] is the discrete input signal. It should be noted that the discrete wavelet transform (DWT) is computationally more efficient than the CWT and applied more frequently. However, it is appropriate for extracting specific information from a signal. The CWT on the other hand is better suited for scanning all the time-frequency components in a signal for finer details and hence, it is better suited for this work.

CWT Time-Frequency Analysis in Python
Time-frequency analysis decouples a signal into its distinct constituent components by projecting it on the time-frequency plane. In this work, we used CWT timefrequency analysis of PDS signals and the Python script for that is as follows

Time-Frequency Analysis of PDS Signals
We calculated the CWT for the simulated DEER traces and plotted those in Figure 6 and  Figure 7 illustrate minor, but clearly visible differences, suggesting strong similarity among all the distance distributions, with minor, but detectable differences among all of them. Thus, the time-frequency analysis reveals significant information about different samples prior to the P (r) reconstruction process. On the other hand, such results for identical samples could indicate artifacts, reproducibility issues and inconsistency in sample preparation.

Time-Frequency Analysis and the Modality of the Distance Distributions
For a qualitative analysis of the correlation between the differences in P (r) and the corresponding time-frequency contour plots, we have plotted the P (r) and DEER trace components along with their time-frequency plots for P(r)-I.D in Figure 8 (top four rows). The CWT time-frequency contour plots show that both the frequency and pattern along the time-domain varies with the modal distance of the distribution as well as its width. While the sum of the time-frequency plot shows close resemblance to the top row plot, indicating the dominance of the corresponding P (r) in the mixture, it also demonstrates clear differences. It can be seen that the time-frequency plots of the unimodal distributions have more prominent and periodic features compared to that of the summed signal. In multimodal and overlapping distance distributions, such periodic patterns tend to cancel out. Thus, a time-frequency plot with truncated features suggests the presence of such multimodal and overlapped distance distributions. The same observations are the case for model-II in Figure 9. The level of loss of the features is proportional to the number of closely spaced modal distances present in a distribution. Hence, it may be possible to train machine learning clustering algorithms against a large dataset of model distance distributions and their time-frequency patterns, which could then indicate the number of modal distances in a distribution.

Time-Frequency Analysis Using Different Wavelets
In Figure 10, we repeated the time-frequency analysis for model-I DEER time domain signals using Gaussian-4 ('Gaus4') and Mexican Hat ('Mexh') wavelets and plotted the results along with that of the 'Gaus2' analysis, shown previously. Despite capturing different time-frequency sensitivity, core features of the CWT spectral pattern remain the same. Small variations between the CWT spectra reveal the sensitivity patterns for the different wavelets, but the core features remain the same. Hence, it can been seen that CWT spectra obtain for the dipolar signal is largely independent of the type of wavelet used, demonstrating the robustness of the analysis. We found that among the wavelet families available for CWT time-frequency analysis at present, the Gaussian family is the most suited for the analysis performed in this work. It is worth mentioning that other standard wavelets, such as Coiflet and Daubechies are not available for CWT analysis in Python or MATLAB at present. Therefore, we plan to develop new wavelets for deeper time-frequency analysis of PDS signals in the near future.

Conclusion
Through SDSL, spin labels are attached to specific domains of a protein and then application of PDS yields targeted structural information, i.e., distance distributions between the spin probes. The derivation of distance distributions is an ill-posed problem and in many cases, the results vary with the methods of reconstruction used in the analysis. In such cases, it is important to cross-validate the results and propose a solution with the least uncertainty. We illustrated in this work that continuous wavelet transform-based time-frequency analysis could be used for distingushing unimodal and multimodal distance distributions. We used eight model distance distributions and compared the solutions obtained from SF-SVD and the DEERLab Tikhonov regularization methods to illustrate the issue. The CWT time-frequency analysis reliably distinguishes between such very similar PDS signals, indicating the presence of unimodal vs. multimodal distance distributions. This method could be further developed for analysis of PDS signals to cross validate derived distance distributions and reduce the uncertainty associated with such analysis. In addition to model-free methods that generate distance distributions from dipolar signals, this method can also be potentially used with training-based P (r) reconstruction methods.
The future work will include development of Coiflet and Daubechies wavelets in CWT. Additionally, we plan to analyze a large dataset of PDS signals and employ appropriate machine learning tools to quantify the correlation between CWT timefrequency patterns and the number of peaks in the distance distributions.